-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32511][SQL] Add dropFields method to Column class #29322
Conversation
Jenkins, test this please. |
Jenkins, add to whitelist. |
Test build #126944 has finished for PR 29322 at commit
|
retest this please |
Test build #126947 has finished for PR 29322 at commit
|
Test build #127062 has finished for PR 29322 at commit
|
Test build #127063 has finished for PR 29322 at commit
|
Test build #127351 has finished for PR 29322 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except a few minor comments
@cloud-fan thanks for your review! Also, could you remove the |
Test build #127390 has finished for PR 29322 at commit
|
retest this please |
The last commit already passed jenkins, I'm merging it to master, thanks! |
Test build #127394 has finished for PR 29322 at commit
|
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala
Show resolved
Hide resolved
Hi, @cloud-fan . Could you update the Apache Jira issue, SPARK-32511, according to your revert, please? |
reopened |
What changes were proposed in this pull request?
Added a new
dropFields
method to theColumn
class.This method should allow users to drop a
StructField
in aStructType
column (with similar semantics to thedrop
method onDataset
).Why are the changes needed?
Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing
StructField
. To do this with the existing Spark APIs, users have to rebuild the entire struct column.For example, let's say you have the following deeply nested data structure which has a data quality issue (
5
is missing):Currently, to drop the missing value users would have to do something like this:
As you can see above, with the existing methods users must call the
struct
function and list all fields, including fields they don't want to change. This is not ideal as:In contrast, with the method added in this PR, a user could simply do something like this to get the same result:
This is the second of maybe 3 methods that could be added to the
Column
class to make it easier to manipulate nested data.Other methods under discussion in SPARK-22231 include
withFieldRenamed
.However, this should be added in a separate PR.
Does this PR introduce any user-facing change?
Only one minor change. If the user submits the following query:
instead of throwing:
it will now throw:
I don't believe its should be an issue to change this because:
but please feel free to correct me if I am wrong.
How was this patch tested?
New unit tests were added. Jenkins must pass them.
Related JIRAs:
More discussion on this topic can be found here: